Robust classification using average correlations as features (ACF)

Motivation In single-cell transcriptomics and other omics technologies, large fractions of missing values commonly occur. Researchers often either consider only those features that were measured for each instance of their dataset, thereby accepting severe loss of information, or use imputation which can lead to erroneous results. Pairwise metrics allow for imputation-free classification with minimal loss of data. Results Using pairwise correlations as metric, state-of-the-art approaches to classification would include the K-nearest-neighbor- (KNN) and distribution-based-classification-classifier. Our novel method, termed average correlations as features (ACF), significantly outperforms those approaches by training tunable machine learning models on inter-class and intra-class correlations. Our approach is characterized in simulation studies and its classification performance is demonstrated on real-world datasets from single-cell RNA sequencing and bottom-up proteomics. Furthermore, we demonstrate that variants of our method offer superior flexibility and performance over KNN classifiers and can be used in conjunction with other machine learning methods. In summary, ACF is a flexible method that enables missing value tolerant classification with minimal loss of data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05224-0.


Figure 2: Comparison of F-ACF and F-DBC without hyperparameter optimization (left) and with hyperparameter optimization (right). On 250 generated datasets, the win rate (higher F 1 -scores) of F-ACF over F-ACF increases from
65.33% to 86.67%when employing hyperparameter optimization.
All datasets considered in this study have been made publicly available by the authors of the corresponding studies. This section describes general properties associated with the data.

10X Genomics
This dataset consists of 2700 peripheral blood mononuclear cells (PBMCs) from a healthy donor and has been published by 10X Genomics on 26 th of May, 2016 under the Creative Commons Attribution license. We followed the Guided Clustering Tutorial of Seurat (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html, last visited on 23 rd of February, 2022) to assign classes to the cells.
We used Seurat V4.0.2 to eliminate cells with more than 2500 and less than 200 unique feature counts, as well as cells with a mitochondrial contamination of ≥5%. This left us with 2638 cells and a total of 32738 unique features.
We selected the 2000 most variable features, before scaling the data and performing a PCA. The first 10 principal components were selected for further analysis.
Using a KNN-graph based clustering approach, Seurat assigns each of the cells to one of 9 clusters with the following class distribution We retrieved the dataset using the R-library scRNAseq by Risso and Cole (2021), V2.4.0. We set zeros to NaN and export the UMI counts. The resulting dataset consisted of 8569 cells from 14 classes and with 20125 features.
In order to enable stratified subsampling of the dataset, we excluded classes that don't make up at least 1% of the total number of cells. Then, we randomly selected a stratified subset of 40% from the dataset.
The final dataset consisted of 3380 cells with all 20125 genes and the class distribution summarized in Table 2. We retrieved the dataset using the R-library scRNAseq by Risso and Cole (2021), V2.4.0. We set zeros to NaN, exclude all contaminated cells and export the RPKM values. The resulting dataset consisted of 1492 cells and 39851 genes.
The resulting class distribution is summarized in Table 3.

Petralia et al.
This dataset has been published by Petralia et al. in their publication "Integrated Proteogenomic Characterization across Major Histological Types of Pediatric Brain Cancer" from 2020. It contains 218 samples of pediatric brain tumors that were analyzed using MS3 and liquid chromatography. The raw proteomics data and processed proteogenomics data is publicly available on the web at https://cptac-data-portal.georgetown.edu/cptacPublic/. Raw data was processed with MaxQuant and 9155 proteins were quantified in total.
The authors of the original publication assign 8 proteomic subtypes with the class distribution summarized in Table 4.

Krug et al.
This dataset has been published by Krug et al. in their publication "Proteogenomic Landscape of Breast Cancer Tumorigenesis and Targeted Therapy" from 2020. 125 primary, treatment-naive breast cancers were analyzed using LC-MS/MS. Both raw and characterized data is publicly available at https://cptac-data-portal.georgetown.edu/study-summary/S060. We obtained the processed dataset directly from the authors of the study. 13769 proteins were quantified in total.
The targeted class in our study is the PAM50 type. We excluded tumors of the "Normal"-type, since only 6 samples of this type were measured. The resulting class distribution is summarized in Table  5.

Considered Model for Batch Effects
We model the batch effect associated with the two proteomic datasets as depicted in Figure 2. Our model assumes, that the only correlations affected by the batch effect are the correlation between samples from the same batch. All other correlations remain unaffected.
Those correlations can be masked using B-ACF.

Results of the Individual Classifiers on the Biologic Datasets
For transparency reasons we report the individual F 1 scores of the respective classifiers on the datasets from scRNA-seq. (cf. Table 6) and multiplexed proteomics (cf . Table 7).
ACF+RF, ACF+Ridge and ACF+SVC indicate our ACF method with a RandomForest, Support-Vector-Classifier and Ridge-Classifier as baseline classifier respectively.
The scores reported for RF, Ridge and SVC resemble the performance of the indicated classifier based on listwise deletion.
Furthermore, we provide the classification report for DBC on the dataset by Xin et al (cf. Table 8).
The results are based on 5-fold, stratified cross validation.